Skip to content

Conversation

@liseli
Copy link
Contributor

@liseli liseli commented Dec 12, 2025

Issue: Catalog search fails with a different input query because Solr special characters are not escaped.

The main goal of this task is to fix the search algorithm for preventing query parser errors and injection.

Ticket: ETT-1200

As part of this PR:

  • A thorough review of the Catalog search has been completed.
  • Unit test has been created.
  • Different solutions have been tested with the goal of finding the smallest change that would have the least impact on the application.
  • A confluence page with a detailed explanation of the issue has been written.

The search algorithm in production consist on:

  • Remove some special characters
  • Fix the input query; some input queries are rejected, and by default, the application makes the query *:*.
  • Validate
  • Tokenize
  • Create the Solr query
  • Some of the inputs that are valid and the production code fails: ~, \, table~~2, ~~~///

What changes have been implemented on the current PR?

  • Remove some special characters
  • Validate and reject invalid queries, and by default, the application makes the query *:*.
    • Refactoring the function validateInput adding additional rules to identify invalid queries before sending them to the Solr server.
    • All of these inputs are invalid: ~, \, table~~2, ~~~///
  • Tokenize
  • Escape special characters
    • Create a set of functions to escape the different syntax included when the q field is created.
    • Create a function to escape special characters when the fq field is created
  • Create the Solr query

How to test:

docker compose build
docker composer up -d

Next step:

  • Try testing the application to identify any related issues.
  • Take a moment to compare the production version with the current output to see what's different.
  • Also, consider testing the Catalog application locally using various input queries to ensure everything works smoothly.
  • The function lucene_escape should be replaced by lucene_escape_fq.

@liseli liseli requested review from aelkiss and moseshll December 15, 2025 19:26
Copy link
Contributor

@moseshll moseshll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works as expected. See the original ticket for a follow-up about the display of this previously-broken facet now that we can select it without an error -- the quotes are missing when the facet is in the "Current Filters". May be out of scope and it's just cosmetic. APPROVE

EDIT: I just put a screen shot on the ticket

Copy link
Member

@aelkiss aelkiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main question I have is if we should be doing something more general for escaping queries. In particular it doesn't seem like this will address the issues with characters like ~ and \. Given that these don't cause issues in ls, I think it's worth seeing if we can apply the more general escaping strategy.

@liseli liseli force-pushed the ETT-1200_facetSearchError branch from bc9f577 to 7f4f26d Compare January 13, 2026 23:00
@aelkiss aelkiss requested review from aelkiss and moseshll January 27, 2026 14:59
@liseli liseli force-pushed the ETT-1200_facetSearchError branch from 4c3f1d5 to e8fcf91 Compare January 27, 2026 15:51
…e removed or escaped to avoid Solr error parser or field injection issue.

Adding unit test for all the functions involve on creating the Solr query tests for the functions build_and_or_onephrase, tokenizeInput.
Refactoring validateInput function, CleanUp build_and_or_onephrase and Apply escape to create q and fq Solr fields.
Removing illegal characters if the query input is invalid, otherwise escaped the input query.
Redefined the logic to escape q field considering different escaping rule to each semantic representation.
Detecting different kind of unbalanced quotes when the SearchStructure is created.
Removing statement on smartquote_problem_talking_catalog test for testing.
Fixing playwright test to check unbalance funcy quotes.
Integrate the normalization of the fields lcnormalized and stdnum.
Document the logic to escape Solr input queries.
Handle multiple issues in invalid input queries.
@liseli liseli force-pushed the ETT-1200_facetSearchError branch from e8fcf91 to 490cacf Compare January 27, 2026 16:21
Copy link
Member

@aelkiss aelkiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried locally with a variety of malformed things and wasn't able to get any errors from solr. I also verified the quotes in facets works now.

We should probably remove the remaining error_log calls that were added; also see the comments on some of the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure whether it makes sense to test the behavior with e.g. ~2 here, since that isn't syntax we claim to support. It isn't necessarily be a problem to pass it through to Solr, but I'm not sure why so many of the test cases here have it.

$tokens = $this->solr->tokenizeInput('table AND "chair leg"~2');

$this->assertSame(
['table AND "chair leg"~2'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this is the behavior we want, although again, it doesn't really matter as we don't claim to support boolean operators in search strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic to tokenize input queries is based on a regular expression. I created some tests to check if the expected behavior. For this specific test, I wanted to confirm that tokenizeInput() captures and does not split phrases containing boolean expressions. The function also keeps expressions such as fuzzy searches like "hello world"~5; and exact phrase - double-quoted string.

public function testAcceptsValidFieldedQueries(): void
{
$valid = [
'title:table',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here this isn't syntax we claim to support and it won't work as expected.

For example http://localhost:8080/Search/Home?lookfor=author%3Achaucer&searchtype=all returns no results; you need to specifically search by author: http://localhost:8080/Search/Home?lookfor=chaucer&searchtype=author

I don't think we need to reject these kinds of queries outright (and we definitely don't want to get a solr error), but the test is maybe misleading in that it suggests to a reader that these queries would behave as expected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment to this test to clarify that it checks whether an input query is valid. It's not about the types of queries we support for search.

For example, if the user types a query like author: 'author:smith' http://localhost:8080/Search/Home?lookfor=title%3Atable&searchtype=title

Our search algorithm will understand that the field used for the query is author and that the query string is author:smith. This query is not rejected, but the result is empty because author:smith does not exist in our index.

sys/Solr.php Outdated
$args = array_merge($args, $this->spellcheckComponents($ss));
}

error_log("Action before simplesearch: " . $action);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover debugging stuff -- should remove

sys/Solr.php Outdated
die();
}

error_log("Solr action used: " . $action);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another extra logging we probably don't want

* - ' "table" ' → table
* - '"table name"' → table name
* - 'table "name"' → table "name" (unchanged)
* - '"table"name"' → "table"name" (unchanged)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment appears to be at odds with the behavior of the code. I would expect "table"name" to become table"name.

I also recommend adding a test for this method. Here's one:

    /* ============================================================
     * remove_quotes()
     * ============================================================
    */

    /**
    * @covers Solr::remove_quotes
    */
    public function testRemoveQuotes(): void
    {
        $this->assertSame(
            'table',
            $this->solr->remove_quotes('"table"')
        );

        $this->assertSame(
            'table',
            $this->solr->remove_quotes('  "table" ')
        );

        $this->assertSame(
            'table name',
            $this->solr->remove_quotes('"table name"')
        );

        $this->assertSame(
            'table "name"',
            $this->solr->remove_quotes('table "name"')
        );

        $this->assertSame(
            'table"name',
            $this->solr->remove_quotes('"table"name"')
        );
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants